Note: This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Description of Datasets
The starting datsets are the 94 families of the HomFam protein sequences.
See Data Generation Workflow for steps to reproduce the datasets used in this analysis.
For each family and for each “number of sequences”, 10 datasets where generated.
These datasets can be download from here.
ClustalO Standard Alignments
These alignment where generated with the following command:
clustalo --infile=${datasetID}.${size}.${rep}.fa
--outfmt=fa
--force
-o ${datasetID}.${size}.${rep}.aln
ClustalO Regressive Alignments
These alignment where generated with the following command:
t_coffee -dpa -dpa_method clustalo_msa \
-dpa_tree ${guide_tree} \
-seq ${seqs} \
-dpa_nseq ${bucket_size} \
-outfile ${id}.${size}.${rep}.dpa.${bucket_size}.${align_method}.with.${tree_method}.tree.aln
ClustalO Guide Trees
All guide trees for each dataset were generated using the following command:
clustalo -i ${seqs} --guidetree-out "${id}.${tree_method}.${size}.${rep}.dnd"
The same tree was used for the DPA tree and Standard ClustalO guide tree for each dataset.
Data Generation Workflow
The data used in this analysis was generated from a Nextflow workflow.
Nextflow is a framework that enables portable and reproducible workflows.
You can find the GitHub respository for the workflow here
You can to generate the data yourself with the following steps:
# Download Nextflow
wget -qO- https://get.nextflow.io | bash
# Run the example dataset
./nextflow run skptic/embeded-analysis-nf
R Data Analysis
- Prerequisites: Install and load packages
install.packages("plotly", repos="http://cran.rstudio.com/", dependencies=TRUE)
library(plotly)
- Step 1: Import the alignment datasets
clustalo_std_raw <- read.csv("~/Downloads/heatmap_data_clustalo_std.csv", row.names=1)
clustalo_reg_raw <- read.csv("~/Downloads/heatmap_data_clustalo_dpa.csv", row.names=1)
- Step 2: Display the ClustalO Standard Alignment Dataset
clustalo_std_raw
- Step 3: Display the ClustalO Regressive Alignment Dataset
clustalo_reg_raw
- Step 4: Normalise both datasets by the first coloumn and thentranspose
clustalo_std_norm=apply(clustalo_std_raw, 1, function(x){x/x[1]})
clustalo_std_norm_t=t(clustalo_std_norm)
clustalo_reg_norm=apply(clustalo_reg_raw, 1, function(x){x/x[1]})
clustalo_reg_norm_t=t(clustalo_reg_norm)
- Step 5 Plot the ClustalO Standard Alignment Scores:
plot_ly(x=colnames(clustalo_std_norm_t), y=rownames(clustalo_std_norm_t), z = clustalo_std_norm_t, type = "heatmap") %>% layout(yaxis = list(autorange = "reversed"))
- Step 5 Plot the ClustalO Regressive Alignment Scores:
plot_ly(x=colnames(clustalo_reg_norm_t), y=rownames(clustalo_reg_norm_t), z = clustalo_reg_norm_t, type = "heatmap") %>% layout(yaxis = list(autorange = "reversed"))
LS0tCnRpdGxlOiAiRmlndXJlIDE6VGhlIGVmZmVjdCBvZiB0aGUgbnVtYmVyIG9mIHNlcXVlbmNlcyBvbiBhbGlnbm1lbnQgYWNjdXJhY3kgb2Ygc3RhbmRhcmQgYW5kIHJlZ3Jlc3NpdmUgYWxpZ25lbW50cyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKTm90ZTogVGhpcyBpcyBhbiBbUiBNYXJrZG93bl0oaHR0cDovL3JtYXJrZG93bi5yc3R1ZGlvLmNvbSkgTm90ZWJvb2suIFdoZW4geW91IGV4ZWN1dGUgY29kZSB3aXRoaW4gdGhlIG5vdGVib29rLCB0aGUgcmVzdWx0cyBhcHBlYXIgYmVuZWF0aCB0aGUgY29kZS4gCgojIyBEZXNjcmlwdGlvbiBvZiBEYXRhc2V0cwpUaGUgc3RhcnRpbmcgZGF0c2V0cyBhcmUgdGhlIDk0IGZhbWlsaWVzIG9mIHRoZSBIb21GYW0gcHJvdGVpbiBzZXF1ZW5jZXMuCgpTZWUgRGF0YSBHZW5lcmF0aW9uIFdvcmtmbG93IGZvciBzdGVwcyB0byByZXByb2R1Y2UgdGhlIGRhdGFzZXRzIHVzZWQgaW4gdGhpcyBhbmFseXNpcy4KCkZvciBlYWNoIGZhbWlseSBhbmQgZm9yIGVhY2ggIm51bWJlciBvZiBzZXF1ZW5jZXMiLCAxMCBkYXRhc2V0cyB3aGVyZSBnZW5lcmF0ZWQuCgpUaGVzZSBkYXRhc2V0cyBjYW4gYmUgZG93bmxvYWQgZnJvbSBbaGVyZV0oKS4KCiMjIyBDbHVzdGFsTyBTdGFuZGFyZCBBbGlnbm1lbnRzIAoKVGhlc2UgYWxpZ25tZW50IHdoZXJlIGdlbmVyYXRlZCB3aXRoIHRoZSBmb2xsb3dpbmcgY29tbWFuZDoKCmBgYHtiYXNofQpjbHVzdGFsbyAtLWluZmlsZT0ke2RhdGFzZXRJRH0uJHtzaXplfS4ke3JlcH0uZmEgCiAgICAgICAgIC0tb3V0Zm10PWZhIAogICAgICAgICAtLWZvcmNlIAogICAgICAgICAtbyAke2RhdGFzZXRJRH0uJHtzaXplfS4ke3JlcH0uYWxuCmBgYAoKIyMjIENsdXN0YWxPIFJlZ3Jlc3NpdmUgQWxpZ25tZW50cwoKVGhlc2UgYWxpZ25tZW50IHdoZXJlIGdlbmVyYXRlZCB3aXRoIHRoZSBmb2xsb3dpbmcgY29tbWFuZDoKCmBgYHtiYXNofQp0X2NvZmZlZSAtZHBhIC1kcGFfbWV0aG9kIGNsdXN0YWxvX21zYSBcCiAgICAgICAgIC1kcGFfdHJlZSAke2d1aWRlX3RyZWV9IFwKICAgICAgICAgLXNlcSAke3NlcXN9IFwKICAgICAgICAgLWRwYV9uc2VxICR7YnVja2V0X3NpemV9IFwKICAgICAgICAgLW91dGZpbGUgJHtpZH0uJHtzaXplfS4ke3JlcH0uZHBhLiR7YnVja2V0X3NpemV9LiR7YWxpZ25fbWV0aG9kfS53aXRoLiR7dHJlZV9tZXRob2R9LnRyZWUuYWxuCmBgYAoKIyMjIENsdXN0YWxPIEd1aWRlIFRyZWVzCgpBbGwgZ3VpZGUgdHJlZXMgZm9yIGVhY2ggZGF0YXNldCB3ZXJlIGdlbmVyYXRlZCB1c2luZyB0aGUgZm9sbG93aW5nIGNvbW1hbmQ6CgpgYGB7YmFzaH0KY2x1c3RhbG8gLWkgJHtzZXFzfSAtLWd1aWRldHJlZS1vdXQgIiR7aWR9LiR7dHJlZV9tZXRob2R9LiR7c2l6ZX0uJHtyZXB9LmRuZCIKYGBgCgpUaGUgc2FtZSB0cmVlIHdhcyB1c2VkIGZvciB0aGUgRFBBIHRyZWUgYW5kIFN0YW5kYXJkIENsdXN0YWxPIGd1aWRlIHRyZWUgZm9yIGVhY2ggZGF0YXNldC4KCgojIyBEYXRhIEdlbmVyYXRpb24gV29ya2Zsb3cgClRoZSBkYXRhIHVzZWQgaW4gdGhpcyBhbmFseXNpcyB3YXMgZ2VuZXJhdGVkIGZyb20gYSBbTmV4dGZsb3ddKCkgd29ya2Zsb3cuCgo8ZGl2IHN0eWxlPSJ3aWR0aDoxMDBweDsgaGVpZ2h0OjIwcHgiPgohW10oaHR0cHM6Ly9naXRodWIuY29tL25leHRmbG93LWlvL3RyYWRlbWFyay9yYXcvbWFzdGVyL25leHRmbG93MjAxNF9uby1iZy5wbmcpCjwvZGl2PiAKCk5leHRmbG93IGlzIGEgZnJhbWV3b3JrIHRoYXQgZW5hYmxlcyBwb3J0YWJsZSBhbmQgcmVwcm9kdWNpYmxlIHdvcmtmbG93cy4KCllvdSBjYW4gZmluZCB0aGUgR2l0SHViIHJlc3Bvc2l0b3J5IGZvciB0aGUgd29ya2Zsb3cgW2hlcmVdKGh0dHBzOi8vZ2l0aHViLmNvbS9za3B0aWMvZW1iZWRlZC1hbmFseXNpcy1uZi90cmVlL21hc3Rlci90ZW1wbGF0ZXMpCgpZb3UgY2FuIHRvIGdlbmVyYXRlIHRoZSBkYXRhIHlvdXJzZWxmIHdpdGggdGhlIGZvbGxvd2luZyBzdGVwczoKCmBgYHtiYXNofQojIERvd25sb2FkIE5leHRmbG93CndnZXQgLXFPLSBodHRwczovL2dldC5uZXh0Zmxvdy5pbyB8IGJhc2gKCiMgUnVuIHRoZSBleGFtcGxlIGRhdGFzZXQKLi9uZXh0ZmxvdyBydW4gc2twdGljL2VtYmVkZWQtYW5hbHlzaXMtbmYKYGBgCgojIyBSIERhdGEgQW5hbHlzaXMKCiogUHJlcmVxdWlzaXRlczogSW5zdGFsbCBhbmQgbG9hZCBwYWNrYWdlcwpgYGB7cn0gCmluc3RhbGwucGFja2FnZXMoInBsb3RseSIsIHJlcG9zPSJodHRwOi8vY3Jhbi5yc3R1ZGlvLmNvbS8iLCBkZXBlbmRlbmNpZXM9VFJVRSkKbGlicmFyeShwbG90bHkpCmBgYAoKCiogU3RlcCAxOiBJbXBvcnQgdGhlIGFsaWdubWVudCBkYXRhc2V0cwpgYGB7cn0KY2x1c3RhbG9fc3RkX3JhdyA8LSByZWFkLmNzdigifi9Eb3dubG9hZHMvaGVhdG1hcF9kYXRhX2NsdXN0YWxvX3N0ZC5jc3YiLCByb3cubmFtZXM9MSkKY2x1c3RhbG9fcmVnX3JhdyA8LSByZWFkLmNzdigifi9Eb3dubG9hZHMvaGVhdG1hcF9kYXRhX2NsdXN0YWxvX2RwYS5jc3YiLCByb3cubmFtZXM9MSkKYGBgCgoqIFN0ZXAgMjogRGlzcGxheSB0aGUgQ2x1c3RhbE8gU3RhbmRhcmQgQWxpZ25tZW50IERhdGFzZXQKYGBge3J9CmNsdXN0YWxvX3N0ZF9yYXcKYGBgCgoqIFN0ZXAgMzogRGlzcGxheSB0aGUgQ2x1c3RhbE8gUmVncmVzc2l2ZSBBbGlnbm1lbnQgRGF0YXNldApgYGB7cn0KY2x1c3RhbG9fcmVnX3JhdwpgYGAKCiogU3RlcCA0OiBOb3JtYWxpc2UgYm90aCBkYXRhc2V0cyBieSB0aGUgZmlyc3QgY29sb3VtbiBhbmQgdGhlbnRyYW5zcG9zZQpgYGB7cn0KY2x1c3RhbG9fc3RkX25vcm09YXBwbHkoY2x1c3RhbG9fc3RkX3JhdywgMSwgZnVuY3Rpb24oeCl7eC94WzFdfSkKY2x1c3RhbG9fc3RkX25vcm1fdD10KGNsdXN0YWxvX3N0ZF9ub3JtKQoKY2x1c3RhbG9fcmVnX25vcm09YXBwbHkoY2x1c3RhbG9fcmVnX3JhdywgMSwgZnVuY3Rpb24oeCl7eC94WzFdfSkKY2x1c3RhbG9fcmVnX25vcm1fdD10KGNsdXN0YWxvX3JlZ19ub3JtKQpgYGAKCgoKKiBTdGVwIDUgUGxvdCB0aGUgQ2x1c3RhbE8gU3RhbmRhcmQgQWxpZ25tZW50IFNjb3JlczogCmBgYHtyfQpwbG90X2x5KHg9Y29sbmFtZXMoY2x1c3RhbG9fc3RkX25vcm1fdCksIHk9cm93bmFtZXMoY2x1c3RhbG9fc3RkX25vcm1fdCksIHogPSBjbHVzdGFsb19zdGRfbm9ybV90LCB0eXBlID0gImhlYXRtYXAiKSAlPiUgbGF5b3V0KHlheGlzID0gbGlzdChhdXRvcmFuZ2UgPSAicmV2ZXJzZWQiKSkKYGBgCgoqIFN0ZXAgNSBQbG90IHRoZSBDbHVzdGFsTyBSZWdyZXNzaXZlIEFsaWdubWVudCBTY29yZXM6IApgYGB7cn0KcGxvdF9seSh4PWNvbG5hbWVzKGNsdXN0YWxvX3JlZ19ub3JtX3QpLCB5PXJvd25hbWVzKGNsdXN0YWxvX3JlZ19ub3JtX3QpLCB6ID0gY2x1c3RhbG9fcmVnX25vcm1fdCwgdHlwZSA9ICJoZWF0bWFwIikgJT4lIGxheW91dCh5YXhpcyA9IGxpc3QoYXV0b3JhbmdlID0gInJldmVyc2VkIikpCmBgYAo=